%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Programming HASH Tables                          by  David Tiberio %%
%%                                     dtiberio@libserv1.ic.sunysb.edu %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

One of the most important data structures used in programming is the HASH
table. This system has many advantages; it is fast, the data is easy to
organize, and it is easy to program. One of the drawbacks is that it is
not as dynamic as some data structures, although it can be combined with
other data structures to make it more dynamic. Another drawback is that it
may be hard to find a good hashing function that applies equally well to
all of your data.

A hash table is an array of fixed size. Each location of the array can
hold one piece of data or many pieces of data, including other hash tables
or linked lists. An example of a hash table is the telephone system. In
this example, I will be using the American telephone system.

Each region has its own three digit number. For example, Long Island uses
the 516 area code. Queens (one part of New York City) uses the 718 area
code. Manhattan, also a part of New York City, uses 212. Other parts of
New York City use different area codes.
Next there are local calling exchanges. East Setauket uses numerous
exchanges, such as 476 and 473. Stony Brook University uses 632 and 444,
although other exchanges are used for the Stony Brook area. Last of all,
each house can have its own 4 digit code. For example, my house has 4
lines coming into it. One is 5156, another 6351, another 3516, and another
1615. Now, we take these three groups of numbers and put them together to
get a hashed phone number. To reach my house, you could call 516-476-6351,
516-473-5156, 516-476-3516, etc. Note that none of these numbers ring my
personal phone BTW :). Some of these numbers used to be here but were
disconnected (such as my old BBS number). Now, anyone in North America
would dial one of these numbers, which would be hashed through the system.
First to Long Island, then to East Setauket, then to my house.


Here is another example. You have 10 people, and you want to store them in
an array with 5 spaces. Here are the people:

    David Pleasance
    Irving Gould
    Jay Miner
    Jim Drew
    John DiLullo
    Lewis Eggebrecht
    Marc Barrett
    Mehdi Ali
    RJ Michaels
    Rob Niles

Here is how the information is stored:

1

2   RJ

3   Jay, Jim, Rob

4   John, Marc

5   David, Lewis, Mehdi

6   Irving

What about each person places them in the given array location? The HASH
function in this case is very simple. The length of the first name was
used, placing RJ all by himself in the 2 letter location with Irving in the
6 letter bucket (a bucket is a term used ot refer to an array location).
Now, in the cases were more than one person is in a bucket, a link is made
between each name starting at the first name placed in the bucket. For
example, to find Mehdi, you would look in bucket 5 since Mehdi is 5 letters
long. First you find David, then you check to the next name on the list to
find Lewis, and then the next name you find is Mehdi. After Mehdi, you
would place a NULL pointer, as with the last element of each bucket. The
1st bucket merely contains a NULL pointer.


There is more than just one way to make a HASH function. Here are some to
try:

    AMIGA: Add the ASCII characters for each letter together,
           then do a MOD (MOD is a remainder function) on the size of the
           array. In C, this might look like:

           bucket = ('A' + 'M' + 'I' + 'G' + 'A') % array_size;

    RULES: Let's try something different with this word (which I chose as
           an example for no particular reason). Instead of adding the
           ASCII values, we could try using RAW keycodes. The RAW keycodes
           are determined by the location of the keys on the keyboard,
           as opposed to alphabetically. For example, a DVORAK keyboard
           would have different RAW keycodes than a QWERTY keyboard.
           The % (MOD function) is used to ensure that if the total is
           greater than the size of the array, then it will start again
           from the beginning of the array until it fits.

           bucket = (19 + 22 + 40 + 18 + 33) % array_size;


There are some key issues to remember. First, the hash function should
tell the difference between the position of letters. For example, the
above sample with AMIGA would have the same bucket as AGIMA or MAGIA,
since they contain the same ASCII values. The total HASH value should
not exceed the size of the array (hence the MOD function is used). You
may place more than one item in a bucket only if you link them with a
link list or use a second hash function to divide the bucket into a
second array. Or, if there is a collision (an item is already in the
bucket), you can check the next bucket in the array until you find a
free one. If you use the second method, the size of the array must be
equal to or greater than the total number of data items you are working
with.

- USENET REFERENCES -

    Please see comp.sys.amiga.programmer for more information about using
HASH tables or linked lists.


As Edward Cayce might say, that is all for the present...

dtiberio@libserv1.ic.sunysb.edu



converted with guide2html by Kochtopf